Introduction

Here we look at using InferCNV to identify tumor and normal cells in SCPCS000490.

Before running this notebook, InferCNV is run using the run-infercnv.R script. This script runs InferCNV once with Endothelial cells as the reference cells and once without any cells as the reference.

Here, we look the copy number variations found reported by InferCNV and use that copy number information to call cells as tumor or normal cells. We then compare these classifications to manual classifications and annotations from SingleR and CellAssign.

Setup

suppressPackageStartupMessages({
  # load required packages
  library(SingleCellExperiment)
  library(ggplot2)
  library(dplyr)
})

knitr::opts_chunk$set(
  dev = "jpeg"
)
# The base path for the OpenScPCA repository, found by its (hidden) .git directory
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")
sample_dir <- file.path(data_dir, "SCPCP000015", params$sample_id)

# The path to this module
module_base <- file.path(repository_base, "analyses", "cell-type-ewings")
# source in helper functions
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
source(jaccard_functions)
# Input files
sce_filename <- glue::glue("{params$library_id}_processed.rds")
sce_file <- file.path(sample_dir, sce_filename)

# png ref files 
png_file <-file.path(params$infercnv_dir, glue::glue("{params$library_id}_infercnv.png"))

# with ref cnv metadata file 
cnv_metadata_file <- file.path(params$infercnv_dir, glue::glue("{params$library_id}_cnv-metadata.tsv"))

# full infercnv object with information on genes/ chr to use later 
final_cnv_obj_file <- file.path(params$infercnv_dir, glue::glue("{params$library_id}_cnv-obj.rds"))

# output classifications file 
infercnv_results_file <- file.path(params$results_dir, glue::glue("{params$library_id}_infercnv-classifications.tsv"))
# make sure files exist
stopifnot(
  "SCE file is missing" = file.exists(sce_file),
  "PNG file with heatmap is missing" = file.exists(png_file),
  "CNV metadata file is missing" = file.exists(cnv_metadata_file),
  "Final InferCNV object is missing" = file.exists(final_cnv_obj_file),
  "Marker gene classifications are missing" = file.exists(params$marker_gene_classification)
)
# read in sce object
sce <- readr::read_rds(sce_file)

# read in cnv metadata 
# use read.delim instead of readr because we want the rownames to be read in properly 
cnv_metadata_df <- read.delim(cnv_metadata_file) |> 
  tibble::rownames_to_column("barcodes")

# read in tumor normal classifications 
manual_classifications_df <- readr::read_tsv(params$marker_gene_classification)
## Rows: 4290 Columns: 2
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): barcodes, marker_gene_classification
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# check if marker gene annotations are present 
if(all(is.na(manual_classifications_df$marker_gene_classification))){
  has_marker_gene <- FALSE
  message("No annotations were available using only marker gene expression. 
          Any plots comparing InferCNV to marker gene annotation will be skipped.")
} else {
  has_marker_gene <- TRUE 
}

# read in cnv object 
final_cnv_obj <- readr::read_rds(final_cnv_obj_file)

InferCNV Heatmap

Endothelial cells as reference

Below is the heatmap produced by InferCNV showing the CNVs for each cell using normal cells as a reference. Each row is a cell and each column is a gene. The top group contains all cells that were used as the reference group and the bottom is all other cells. The color indicates the relative expression intensity of each gene with the darker colors either indicating a gain of expression or loss of expression compared to the reference cells.

From the InferCNV documentation:

The normal cells in the top heatmap define baseline expression for genes in normal cells. This baseline distribution of normal gene expression is subtracted from both the normal cells from which was defined as well as the tumor cells. Afterwards, the normal cell expression heatmap should be largely devoid of signal, with the exception of certain outlier gene expression values in certain cells. Removing this baseline of normal expression signal from the tumor cells should reveal those chromosomal regions that have significantly more or less expression than the normal cells, highlighting likely amplified or deleted whole chromosomes or large chromosomal regions.

The dendogram on the left side is hierarchical clustering of all non-reference cells. The color annotation next to it indicates subdivisions of the dendogram by cutting it in k_obs_groups groups, with the default k_obs_groups = 1.

InferCNV Heatmap - Endothelial cells as reference
InferCNV Heatmap - Endothelial cells as reference

Using CNVs to annotate tumor cells

InferCNV uses an HMM-based model to predict whether or not a copy number variation (either loss or gain) is present for each chromosome in each cell. For each predicted CNV region, a posterior probability is calculated for that CNV for each cell.

From the InferCNV documentation:

CNV regions with mean posterior probabilities of being normal (no CNV) that are above a maximum threshold are removed as likely false positive predictions.

Before we can make any plots we need to add some metadata that contains CNV information on a per cell level to the cell metadata.

# make df for plotting
coldata_df <- sce |> 
  scuttle::makePerCellDF(use.dimred = "UMAP") |> 
  # replace UMAP.1 with UMAP1
  dplyr::rename_with(
        \(x) stringr::str_replace(x, "^UMAP\\.", "UMAP")
      ) |> 
  # add in per cell cnv information 
  dplyr::left_join(cnv_metadata_df, by = c("barcodes"))

# remove MT and GL chr, comments in docs recommend that these are not useful and can be inaccurate 
chr_columns_to_keep <- which(!stringr::str_detect(names(coldata_df), "chrMT|chrGL"))
all_cnv_df <- coldata_df[, chr_columns_to_keep]

The metadata that we added includes the following columns for each chromosome:

The has columns are boolean columns indicating if that cell has presence of a CNV, loss, or duplication event. If has_cnv is TRUE if either has_loss or has_dupli is TRUE.

See the comments in this issue explaining what the contents of the columns are.

One approach to classifying tumor cells is to look for cells that have a lot of CNVs. Let’s start by just summing up the total number of chromosomes with CNVs per cell and plotting that.

# grab any columns that have has_cnv 
has_cnv_cols <- names(all_cnv_df)[startsWith(colnames(all_cnv_df), "has_cnv")]

# don't include the normal cells that were provided as reference in scaling 
cnv_df <- all_cnv_df |>
  # get total number of chromosomes with cnvs per cell and then scale that number 
  dplyr::mutate(
    has_cnv = ifelse(
      stringr::str_detect(subcluster, "reference"), 
      NA_real_, 
      rowSums(across(has_cnv_cols))
    ))
## Warning: There was 1 warning in `dplyr::mutate()`.
## ℹ In argument: `has_cnv = ifelse(...)`.
## Caused by warning:
## ! Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(has_cnv_cols)
## 
##   # Now:
##   data %>% select(all_of(has_cnv_cols))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# plot the total number of CNVs
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = has_cnv)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() +
  scale_color_viridis_c()

Those cells in gray (bottom left) are cells used as a reference. If we were to use total CNVs we would need some sort of cut off for the number. Let’s look at the distribution of total CNVs to see if we can identify a cut off.

ggplot(cnv_df, aes(x = has_cnv)) +
  geom_freqpoly(binwidth = 1, center = 0) +
  theme_bw() +
  labs(x = "Total number of CNVs")
## Warning: Removed 497 rows containing non-finite outside the scale range
## (`stat_bin()`).

We will call any tumor cell as those with a total number of cnvs greater than the mean(number of cnvs).

# get mean value for has_cnv, excluding any NAs
mean_cnv <- mean(cnv_df$has_cnv, na.rm=TRUE)

# add cnv classification 
cnv_df <- cnv_df |> 
  dplyr::mutate(cnv_classification = dplyr::case_when(is.na(has_cnv) ~ "Reference",
                                                      has_cnv > mean_cnv ~ "Tumor",
                                                      has_cnv <= mean_cnv ~ "Normal"))

ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = cnv_classification)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() 

Another option is to use the scaled proportion of CNVs which tells us the proportion of genes on that chromosome that are part of the CNV.

proportion_cols <- names(cnv_df)[startsWith(colnames(cnv_df), "proportion_scaled_cnv")]

# get total number of genes per chromosome to use in weighted mean 
chr_weights <- final_cnv_obj@gene_order |> 
  as.data.frame() |> 
  dplyr::count(chr) |> 
  # only keep chr 1-22, drops MT and GL chr
  dplyr::filter(chr %in% glue::glue("chr{seq(1,22)}")) |> 
  dplyr::pull(n)

# get mean proportion of cnvs per cell weighted by chromsome size (number of genes)
cnv_df <- cnv_df |> 
  dplyr::rowwise() |> 
  # calculate weighted mean, excluding normal cells 
  dplyr::mutate(
    mean_proportion = ifelse(
      !is.na(has_cnv), 
      weighted.mean(across(proportion_cols), chr_weights), 
      NA)) |> 
  dplyr::ungroup()
## Warning: There was 1 warning in `dplyr::mutate()`.
## ℹ In argument: `mean_proportion = ifelse(...)`.
## ℹ In row 1.
## Caused by warning:
## ! Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(proportion_cols)
## 
##   # Now:
##   data %>% select(all_of(proportion_cols))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = mean_proportion)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() +
  scale_color_viridis_c()

ggplot(cnv_df, aes(x = mean_proportion)) +
  geom_density() +
  theme_bw()
## Warning: Removed 497 rows containing non-finite outside the scale range
## (`stat_density()`).

Let’s see how this matches up with the tumor/normal classifications we defined just using total CNVs.

In the below plot we label all non-reference cells as either tumor or normal. Cells that are NA are those that were already classified as reference cells.

ggplot(cnv_df, aes(x = mean_proportion, color = cnv_classification)) +
  geom_density() +
  theme_bw()
## Warning: Removed 497 rows containing non-finite outside the scale range
## (`stat_density()`).

Below we will call tumor cells based on those that have a mean scaled proportion higher than the mean of all proportions.

# get mean value for mean_proportion, excluding any NAs
all_mean_proportion <- mean(cnv_df$mean_proportion, na.rm = TRUE)


# add cnv classification 
cnv_df <- cnv_df |> 
  dplyr::mutate(cnv_mean_classification = dplyr::case_when(is.na(mean_proportion) ~ "Reference",
                                                           mean_proportion > all_mean_proportion ~ "Tumor",
                                                           mean_proportion <= all_mean_proportion ~ "Normal"))

ggplot(cnv_df, aes(x = UMAP1, y= UMAP2, color = cnv_mean_classification)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() 

For the rest of the notebook, we will use the classifications of tumor cells based on the mean proportion of total CNV occurrences.

Compare InferCNV annotations to manual annotations

Below we can compare these annotations from InferCNV to classifying tumor cells manually with marker genes.

all_classifications_df <- cnv_df |> 
  dplyr::left_join(manual_classifications_df) |>
  # relevel classifications for confusion matrix 
  dplyr::filter(cnv_mean_classification != "Reference") |> 
  dplyr::mutate(
    cnv_classification = forcats::fct_relevel(cnv_mean_classification, "Tumor"),
    marker_gene_classification = forcats::fct_relevel(marker_gene_classification, "Tumor")
  )
## Joining with `by = join_by(barcodes)`
caret::confusionMatrix(
  table(
    all_classifications_df$cnv_classification,
    all_classifications_df$marker_gene_classification
  )
)
## Confusion Matrix and Statistics
## 
##         
##          Tumor Normal
##   Tumor   2276    435
##   Normal    29   1053
##                                           
##                Accuracy : 0.8777          
##                  95% CI : (0.8668, 0.8879)
##     No Information Rate : 0.6077          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7304          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9874          
##             Specificity : 0.7077          
##          Pos Pred Value : 0.8395          
##          Neg Pred Value : 0.9732          
##              Prevalence : 0.6077          
##          Detection Rate : 0.6001          
##    Detection Prevalence : 0.7147          
##       Balanced Accuracy : 0.8475          
##                                           
##        'Positive' Class : Tumor           
## 

We can also compare to the existing annotations obtained using SingleR and CellAssign as part of scpca-nf.

celltype_columns <- c(
  "singler_celltype_annotation",
  "cellassign_celltype_annotation"
)

# create jaccard matrices for SingleR and CellAssign compared to aneuploid/diploid 
jaccard_matrices <- celltype_columns |>
  purrr::map(\(name) {
    make_jaccard_matrix(
      cnv_df,
      "cnv_mean_classification",
      name
    )
  }) |> 
  purrr::set_names("SingleR", "CellAssign")
# Set heatmap padding option
heatmap_padding <- 0.2
ComplexHeatmap::ht_opt(TITLE_PADDING = grid::unit(heatmap_padding, "in"))
# list of heatmaps looking at SingleR/ CellAssign vs tumor/normal 
heatmap <- jaccard_matrices |>
  purrr::imap(
    \(celltype_mat, celltype_method) {
      ComplexHeatmap::Heatmap(
        t(celltype_mat), # transpose because matrix rows are in common & we want a vertical arrangement
        col = circlize::colorRamp2(c(0, 1), colors = c("white", "darkslateblue")),
        border = TRUE,
        ## Row parameters
        cluster_rows = TRUE,
        row_title = celltype_method,
        row_title_gp = grid::gpar(fontsize = 12),
        row_title_side = "left",
        row_names_side = "left",
        row_dend_side = "right",
        row_names_gp = grid::gpar(fontsize = 10),
        ## Column parameters
        cluster_columns = FALSE,
        column_title = "",
        column_title_gp = grid::gpar(fontsize = 12),
        column_names_side = "bottom",
        column_names_gp = grid::gpar(fontsize = 10),
        column_names_rot = 90,
        ## Legend parameters
        heatmap_legend_param = list(
          title = "Jaccard index",
          direction = "vertical",
          legend_width = unit(1.5, "in")
        ),
        show_heatmap_legend = celltype_method == "SingleR",
      )
    }) |>
  # concatenate vertically into HeatmapList object
  purrr::reduce(ComplexHeatmap::`%v%`) |>
  ComplexHeatmap::draw(
    heatmap_legend_side = "right",
    # add a margin to the heatmap so labels don't get cut off
    padding = unit(c(2, 20, 2, 2), "mm")
  )

Check for known CNVs in Ewing’s

There are a few known copy number variations in Ewing’s sarcoma:

Although these are the most frequent, there are patients who do not have any of these alterations and patients that only have some of these alterations. See Tirode et al., and Crompton et al..

First we will compare the proportions across each chromosome between cells we have labeled as tumor or normal. Then we will look specifically at the chromosomes commonly affected in Ewing sarcoma.

# prepare a data frame with one row per chromosome per cell for plotting
proportion_df <- cnv_df |> 
  # select only relevant columns for plotting
  dplyr::select(barcodes, UMAP1, UMAP2, cnv_mean_classification, all_of(proportion_cols)) |> 
  tidyr::pivot_longer(cols = all_of(proportion_cols),
    names_to = "chromosome",
    values_to = "proportion") |> 
  # remove extra string in front of chr 
  dplyr::mutate(
    chromosome = stringr::str_remove(chromosome, "proportion_scaled_cnv_chr"),
    # turn into numeric for easy ordering in plot 
    chromosome = as.numeric(chromosome)
  )
ggplot(proportion_df, aes(x = proportion, color = cnv_mean_classification)) +
  geom_density(bounds = c(0,1), bw = 0.05) +
  theme_bw() +
  facet_wrap(vars(chromosome), ncol = 3)
## Warning: Some data points are outside of `bounds`. Removing them.
## Some data points are outside of `bounds`. Removing them.
## Some data points are outside of `bounds`. Removing them.

Below we plot only Chr1, Chr8, Chr12, and Chr16 and look at the proportion on a UMAP.

ewings_chr_df <- proportion_df |> 
  dplyr::filter(chromosome %in% c(1,8,12,16))

ggplot(ewings_chr_df, aes(x = UMAP1, y=UMAP2 , color = proportion)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() +
  facet_wrap(vars(chromosome)) +
  scale_color_viridis_c()

Export results

We will save the tumor/normal classifications using both the sum and proportion.

# get a dataframe with barcodes, sum_classification and cluster_classification 
celltypes_df <- cnv_df |> 
  dplyr::select(
    barcodes,
    cnv_proportion_classification = cnv_mean_classification,
    cnv_sum_classification = cnv_classification
  )

readr::write_tsv(celltypes_df, infercnv_results_file)

Session Info

# record the versions of the packages used in this analysis and other environment information
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices datasets  utils     methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_1.1.4                 ggplot2_3.5.1              
##  [3] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
##  [5] Biobase_2.64.0              GenomicRanges_1.56.0       
##  [7] GenomeInfoDb_1.40.0         IRanges_2.38.0             
##  [9] S4Vectors_0.42.0            BiocGenerics_0.50.0        
## [11] MatrixGenerics_1.16.0       matrixStats_1.3.0          
## 
## loaded via a namespace (and not attached):
##   [1] pROC_1.18.5               rlang_1.1.3              
##   [3] magrittr_2.0.3            clue_0.3-65              
##   [5] GetoptLong_1.0.5          e1071_1.7-14             
##   [7] compiler_4.4.0            DelayedMatrixStats_1.26.0
##   [9] png_0.1-8                 reshape2_1.4.4           
##  [11] vctrs_0.6.5               stringr_1.5.1            
##  [13] shape_1.4.6.1             pkgconfig_2.0.3          
##  [15] crayon_1.5.2              fastmap_1.2.0            
##  [17] XVector_0.44.0            scuttle_1.14.0           
##  [19] labeling_0.4.3            utf8_1.2.4               
##  [21] rmarkdown_2.26            prodlim_2023.08.28       
##  [23] tzdb_0.4.0                UCSC.utils_1.0.0         
##  [25] purrr_1.0.2               bit_4.0.5                
##  [27] xfun_0.44                 zlibbioc_1.50.0          
##  [29] cachem_1.0.8              beachmat_2.20.0          
##  [31] jsonlite_1.8.8            recipes_1.0.10           
##  [33] highr_0.10                DelayedArray_0.30.1      
##  [35] BiocParallel_1.38.0       cluster_2.1.6            
##  [37] parallel_4.4.0            R6_2.5.1                 
##  [39] RColorBrewer_1.1-3        bslib_0.7.0              
##  [41] stringi_1.8.4             parallelly_1.37.1        
##  [43] rpart_4.1.23              lubridate_1.9.3          
##  [45] jquerylib_0.1.4           Rcpp_1.0.12              
##  [47] iterators_1.0.14          knitr_1.46               
##  [49] future.apply_1.11.2       readr_2.1.5              
##  [51] timechange_0.3.0          Matrix_1.7-0             
##  [53] splines_4.4.0             nnet_7.3-19              
##  [55] tidyselect_1.2.1          abind_1.4-5              
##  [57] yaml_2.3.8                timeDate_4032.109        
##  [59] doParallel_1.0.17         codetools_0.2-20         
##  [61] listenv_0.9.1             lattice_0.22-6           
##  [63] tibble_3.2.1              plyr_1.8.9               
##  [65] withr_3.0.0               evaluate_0.23            
##  [67] future_1.33.2             survival_3.5-8           
##  [69] proxy_0.4-27              circlize_0.4.16          
##  [71] pillar_1.9.0              BiocManager_1.30.23      
##  [73] renv_1.0.7                foreach_1.5.2            
##  [75] generics_0.1.3            vroom_1.6.5              
##  [77] rprojroot_2.0.4           hms_1.1.3                
##  [79] sparseMatrixStats_1.16.0  munsell_0.5.1            
##  [81] scales_1.3.0              globals_0.16.3           
##  [83] class_7.3-22              glue_1.7.0               
##  [85] tools_4.4.0               data.table_1.15.4        
##  [87] ModelMetrics_1.2.2.2      gower_1.0.1              
##  [89] forcats_1.0.0             grid_4.4.0               
##  [91] tidyr_1.3.1               ipred_0.9-14             
##  [93] colorspace_2.1-0          nlme_3.1-164             
##  [95] GenomeInfoDbData_1.2.12   cli_3.6.2                
##  [97] fansi_1.0.6               S4Arrays_1.4.0           
##  [99] viridisLite_0.4.2         ComplexHeatmap_2.20.0    
## [101] lava_1.8.0                gtable_0.3.5             
## [103] sass_0.4.9                digest_0.6.35            
## [105] caret_6.0-94              SparseArray_1.4.3        
## [107] rjson_0.2.21              farver_2.1.2             
## [109] htmltools_0.5.8.1         lifecycle_1.0.4          
## [111] hardhat_1.3.1             httr_1.4.7               
## [113] GlobalOptions_0.1.2       bit64_4.0.5              
## [115] MASS_7.3-60.2